23 - Deep Learning - Common Practices Part 2 [ID:16899]

50 von 78 angezeigt

Welcome everybody to deep learning. So today we want to look into further common practices

and in particular in this video we want to discuss architecture selection and hyperparameter optimization.

Remember the test data is still in the vault, we are not touching it.

However, we need to set our hyperparameter somehow and you've already seen that there's

enormous amount of hyperparameters. You have the architecture, number of layers,

number of nodes per layer, activation functions, then you have all the parameters and the optimization,

the initialization, the loss function, the optimizers like still has the gradient descent,

momentum, add-on, learning rate, decay and batch size and in regularization you have different

regularizers, L2, L1 loss, batch normalization, dropout and so on.

And you want to somehow figure out all the parameters for those different kinds of procedures.

Now let's choose architecture and loss function. And the first step would be think about the

problem and the data. How could features look like? What kind of spatial correlation do

you expect? What data augmentation makes sense? How will the classes be distributed? What

is important regarding the target application? Then you start with simple architectures and

loss functions and of course you do your research. Try well-known models first and foremost.

They are being published, there's so many papers out there, there is no need to do everything

on yourself. So one day in the library can save hours and weeks and months of experimentation.

Do the research, it will really save you time and often they just don't publish a single

paper but in the very good papers it's not just the scientific results but they also

share source code, sometimes even data. Try to find those papers, this can help you a

lot with your own experimentation.

So then you may want to change, adapt this architecture to found in literature and if

you change something, find good reasons why this is an appropriate change. There's quite

a few papers out there that seem to introduce random changes into the architecture and then

later it turns out that the observations that they made were essentially random and they

were just lucky or experimented enough on their own data in order to get the improvements.

Typically there's also a reasonable argument why the specific change should give an improvement

in performance. Next you want to do your hyperparameter search. So you remember learning rate, decay,

regularization, dropout and so on. These have to be tuned but still the networks can take

days or weeks to train and you have to search for these hyperparameters and we recommend

using a log scale. So for example for IDA here you go for 0.1, 0.01, 0.001. You may

want to consider grid search or random search. So in grid search you would have equal distance

steps and if you look here at reference two they have shown that the random search has

really advantages over the grid search. First of all it's easier to implement and second

it has a better exploration of the parameters that have a strong influence on the result.

So you may want to look into that and then adjust your strategy accordingly. So hyperparameters

are highly interdependent. You may want to use a coarse to fine search therefore. So

you optimize on a very coarse scale in the beginning and then make it finer and finer.

You may only train the network for a few epochs and then bring all the hyperparameters in

sensible ranges and then you can refine using random and grid search. A very common technique

that can give you a little bit of boost of performance is ensembling. So this is also

something that can really help you to get this additional little bit of performance

that you still need. So far we have only considered a single classifier but ensembling has the

idea by using many of those classifiers. If you assume n classifiers that are independent

performing a correct prediction will be at a probability of 1 minus p. Now the probability

of seeing k errors is n choose k p to the power of k 1 minus p to the power of n minus

k and this is a binomial distribution. So the probability of a majority meaning k greater

than n over 2 to be wrong is the sum over n choose k p to the power of k 1 minus p to

the power of n minus k. So we visualize this in the following plot. Here in this graph

you can see that if I take more of those weak classifiers and we set for example their probability

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr.-Ing. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:08:57 Min

Aufnahmedatum

2020-05-31

Hochgeladen am

2020-06-01 00:26:36

Sprache

en-US

Deep Learning - Common Practices Part 2

This video discusses the use of validation data and how to choose architectures and hyperparameters and discuss ensembling.

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren